Short vs. long RNA-seq (differentiation over timecourse)
Load packages
Load data
The data is loaded from RDS-objects that have been created with the import_processing.qmd file. For each of the 4 NDR cutoffs (0.025, 0.05, 0.1, 0.2), two data frames exist containing the expression data (counts) of the two different technologies used to obtain the data:
- Illumina sequencing and Salmon quantification (short read)
- PacBio sequencing and Bambu quantification (long read)
In total, this adds up to 8 different data frames. Each of these data frames contains 11 samples, which are replicates from one of five different time points.
bambu_0.025_gene <- df_list_bambu$bambu_0.025_gene
bambu_0.05_gene <- df_list_bambu$bambu_0.05_gene
bambu_0.1_gene <- df_list_bambu$bambu_0.1_gene
bambu_0.2_gene <- df_list_bambu$bambu_0.2_gene
salmon_0.025_gene <- df_list_salmon$salmon_0.025_gene
salmon_0.05_gene <- df_list_salmon$salmon_0.05_gene
salmon_0.1_gene <- df_list_salmon$salmon_0.1_gene
salmon_0.2_gene <- df_list_salmon$salmon_0.2_gene
(meta_samples <- df_list_meta$metadata) sampleID condition time
1 1 day0.rep1 0
2 2 day0.rep2 0
3 3 day0.rep3 0
4 4 day1.rep1 1
5 5 day2.rep1 2
6 6 day3.rep1 3
7 7 day3.rep2 3
8 8 day4.rep1 4
9 9 day5.rep1 5
10 10 day5.rep2 5
11 11 day5.rep3 5
Create DGEList objects and filter expressed genes
The expression data is then filtered using the function filterByExpr to exclude genes that do not show high enough expression or do not show expression in enough samples.
0.025
y_salmon_0.025 <- DGEList(counts = salmon_0.025_gene, group = day)
y_bambu_0.025 <- DGEList(counts = bambu_0.025_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.025)
expressed_bambu <- filterByExpr(y_bambu_0.025)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.025 <-
y_salmon_0.025[-which(rownames(salmon_0.025_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.025 <-
y_bambu_0.025[-which(rownames(bambu_0.025_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}0.05
y_salmon_0.05 <- DGEList(counts = salmon_0.05_gene, group = day)
y_bambu_0.05 <- DGEList(counts = bambu_0.05_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.05)
expressed_bambu <- filterByExpr(y_bambu_0.05)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.05 <-
y_salmon_0.05[-which(rownames(salmon_0.05_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.05 <-
y_bambu_0.05[-which(rownames(bambu_0.05_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}0.1
y_salmon_0.1 <- DGEList(counts = salmon_0.1_gene, group = day)
y_bambu_0.1 <- DGEList(counts = bambu_0.1_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.1)
expressed_bambu <- filterByExpr(y_bambu_0.1)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.1 <-
y_salmon_0.1[-which(rownames(salmon_0.1_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.1 <-
y_bambu_0.1[-which(rownames(bambu_0.1_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}0.2
y_salmon_0.2 <- DGEList(counts = salmon_0.2_gene, group = day)
y_bambu_0.2 <- DGEList(counts = bambu_0.2_gene, group = day)
expressed_salmon <- filterByExpr(y_salmon_0.2)
expressed_bambu <- filterByExpr(y_bambu_0.2)
not_expressed <-
intersect(names(expressed_salmon[!expressed_salmon]), names(expressed_bambu[!expressed_bambu]))
if (length(not_expressed) > 0) {
y_salmon_0.2 <-
y_salmon_0.2[-which(rownames(salmon_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
y_bambu_0.2 <-
y_bambu_0.2[-which(rownames(bambu_0.2_gene) %in% not_expressed), , keep.lib.sizes = FALSE]
}Differential gene expression analysis
The differential gene expression analysis aims to identify genes that are differentially expressed over the time course. These results are computed for all 4 NDR cufoffs of both technologies, and are then overlayed for each NDR cutoff to compare the two technologies, so the long read and short read sequencing.
The design matrix is created in a way that the first time point, day0, is used as a baseline (intercept of the model).
(Intercept) day1 day2 day3 day4 day5
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 0 0 0 0 0
4 1 1 0 0 0 0
5 1 0 1 0 0 0
6 1 0 0 1 0 0
7 1 0 0 1 0 0
8 1 0 0 0 1 0
9 1 0 0 0 0 1
10 1 0 0 0 0 1
11 1 0 0 0 0 1
attr(,"assign")
[1] 0 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$day
[1] "contr.treatment"
0.025
The first NDR cutoff is 0.025. The differential expression analysis is performed with edgeR, and genes are considered as differentially expressed if the adjusted p-value of the log-likelihood test between the full and the reduced model is below 0.01. The p-value adjustment is performed using the Benjamini-Hochberg method.
# Salmon 0.025
y_salmon_0.025 <- calcNormFactors(y_salmon_0.025)
## estimate dispersion
y_salmon_0.025 <- estimateDisp(y_salmon_0.025, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.025, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_s0.025 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.025) day5-day4-day3-day2-day1
NotSig 17374
Sig 7343
# Bambu 0.025
y_bambu_0.025 <- calcNormFactors(y_bambu_0.025)
## estimate dispersion
y_bambu_0.025 <- estimateDisp(y_bambu_0.025, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.025, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.025 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.025) day5-day4-day3-day2-day1
NotSig 16879
Sig 7838
For Illumina+Salmon, out of the 24717 tested genes, 7343 were identified as differentially expressed. In the case of PacBio+Bambu, 7838 genes were identified as differentially expressed. Both methods identify roughly the same amount of genes as differentially expressed.
# overlay results of both technologies
de_genes_0.025 <- as.data.frame(cbind(decide_dif_s0.025, decide_dif_b0.025))
colnames(de_genes_0.025) <- c("Salmon_0.025", "Bambu_0.025")
upset(de_genes_0.025, sets = colnames(de_genes_0.025))From the UpSet plot above it is visible that, even though identifying roughly the same number of genes as differentially expressed, the two technologies have same differences in their results. The majority of genes identified overlap, however, there is still are still some genes only called differentially expressed by either Illumina+Salmon or by PacBio+Bambu. To further investigate the similarities/differences, the logFC and p-values of the analysis was compared.
# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.025)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.025)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The comparison of the p-values after adjustment shows a similar trend in both technologies, but also same differences showing that the results differ. The same is observable for the logFC values. Additionally, it is visible that many genes that show no logFC in the analysis of one technology shows a (high) logFC in the other one, which agrees with the observation in the UpSet plot, that some genes are called differentially expressed by either of the two technologies, but not by both.
After performing the differential gene expression analysis over the whole time course, the first and the last time points were also compared as discrete conditions to identify up- or down-regulated genes only comparing the begin and the end of the experiment.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.025, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.025_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.025_tp) day5
Down 2377
NotSig 18283
Up 4057
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.025_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.025 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.025, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.025_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.025_tp) day5
Down 2014
NotSig 19559
Up 3144
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.025_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.025 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The analysis identifies 2377 down-regulated and 4057 up-regulated genes for Illumian+Salmon and 2014 down-regulated and 3144 up-regulated genes for PacBio+Bambu. Again, 24717 genes were tested and both technologies show similar results.
0.05
The next NDR cutoff is 0.05. The same analysis steps that were performed for the NDR cutoff of 0.025 were performed here.
# Salmon 0.05
y_salmon_0.05 <- calcNormFactors(y_salmon_0.05)
## estimate dispersion
y_salmon_0.05 <- estimateDisp(y_salmon_0.05, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.05, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_s0.05 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.05) day5-day4-day3-day2-day1
NotSig 17444
Sig 7367
# Bambu 0.05
y_bambu_0.05 <- calcNormFactors(y_bambu_0.05)
## estimate dispersion
y_bambu_0.05 <- estimateDisp(y_bambu_0.05, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.05, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.05 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.05) day5-day4-day3-day2-day1
NotSig 16927
Sig 7884
For Illumina+Salmon, out of the 24811 tested genes, 7367 were identified as differentially expressed. In the case of PacBio+Bambu, 7884 genes were identified as differentially expressed. Both methods identify roughly the same amount of genes as differentially expressed.
# overlay results of both technologies
de_genes_0.05 <- as.data.frame(cbind(decide_dif_s0.05, decide_dif_b0.05))
colnames(de_genes_0.05) <- c("Salmon_0.05", "Bambu_0.05")
upset(de_genes_0.05, sets = colnames(de_genes_0.05))The UpSet plot shows similar results compared to the NDR cutoff of 0.025. Again, the majority of genes are detected by both methods, however, some are deteced by either of the two methods.
# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.05)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.05)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The comparison of adjusted p-values and logFC values shows the same results as seen when performing the comparison on the results of the 0.025 NDR cutoff.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.05, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.05_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.05_tp) day5
Down 2388
NotSig 18352
Up 4071
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.05_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.05 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.05, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.05_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.05_tp) day5
Down 2031
NotSig 19608
Up 3172
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.05_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.05 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The differential expression analysis of day5 and day0 identifies 2388 down-regulated and 4071 up-regulated genes for Illumian+Salmon and 2031 down-regulated and 3172 up-regulated genes for PacBio+Bambu. Again, 24811 genes were tested and both technologies show similar results.
0.1
The next NDR cutoff is 0.1. The same analysis steps that were performed for the other NDR cutoffs were performed here as well.
# Salmon 0.1
y_salmon_0.1 <- calcNormFactors(y_salmon_0.1)
## estimate dispersion
y_salmon_0.1 <- estimateDisp(y_salmon_0.1, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.1, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_s0.1 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.1) day5-day4-day3-day2-day1
NotSig 17575
Sig 7529
# Bambu 0.1
y_bambu_0.1 <- calcNormFactors(y_bambu_0.1)
## estimate dispersion
y_bambu_0.1 <- estimateDisp(y_bambu_0.1, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.1, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.1 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.1) day5-day4-day3-day2-day1
NotSig 17056
Sig 8048
The differential expression analysis of the whole time course identified 7529 genes as differentially expressed for Illumina+Salmon and 8048 for PacBio+Bambu. 25104 genes were tested.
# overlay results of both technologies
de_genes_0.1 <- as.data.frame(cbind(decide_dif_s0.1, decide_dif_b0.1))
colnames(de_genes_0.1) <- c("Salmon_0.1", "Bambu_0.1")
upset(de_genes_0.1, sets = colnames(de_genes_0.1))When overlaying the results of the two technologies, it is again visible that most genes are detected by both methods, while some are deteced exclusively by one of them.
# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.1)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 1 rows containing non-finite values (`stat_smooth()`).
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.1)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The comparison of adjusted p-values and logFC values shows the same trends as seen for the other NDR cutoffs.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.1, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.1_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.1_tp) day5
Down 2467
NotSig 18544
Up 4093
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.1_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.1 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.1, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.1_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.1_tp) day5
Down 2093
NotSig 19746
Up 3265
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.1_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.1 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))The time point differential expression analysis (day5 and day0) identified 2467 down-regulated and 4093 up-regulated genes for Illumina+Salmon and 2093 down-regulated and 3265 up-regulated genes for PacBio+Bambu. In total, 25104 were tested.
0.2
The last NDR cutoff is 0.2. Again, the same steps are performed.
# Salmon 0.2
y_salmon_0.2 <- calcNormFactors(y_salmon_0.2)
## estimate dispersion
y_salmon_0.2 <- estimateDisp(y_salmon_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_salmon_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtS <- glmLRT(fit, coef = 2:6)
## save lrt for pathway analysis
lrt_pathway <- lrtS
#topTags(lrt)
## multiple testing correction
decide_dif_s0.2 <-
decideTests.DGELRT(lrtS,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_s0.2) day5-day4-day3-day2-day1
NotSig 18289
Sig 7723
# Bambu 0.2
y_bambu_0.2 <- calcNormFactors(y_bambu_0.2)
## estimate dispersion
y_bambu_0.2 <- estimateDisp(y_bambu_0.2, design)
## fit the negative binomial model
fit <- glmFit(y_bambu_0.2, design)
## conduct genewise statistical tests for a given coefficient contrast (coef = row of design matrix)
lrtB <- glmLRT(fit, coef = 2:6)
#topTags(lrt)
## multiple testing correction
decide_dif_b0.2 <-
decideTests.DGELRT(lrtB,
adjust.method = "BH",
p.value = 0.01)
summary(decide_dif_b0.2) day5-day4-day3-day2-day1
NotSig 17667
Sig 8345
The whole time course differential expression analysis identified 7723 deferentially expressed genes for Illumina+Salmon and 8345 for PacBio+Bambu. For the NDR cutoff of 0.2, 26012 genes were tested.
# overlay results of both technologies
de_genes_0.2 <- as.data.frame(cbind(decide_dif_s0.2, decide_dif_b0.2))
colnames(de_genes_0.2) <- c("Salmon_0.2", "Bambu_0.2")
upset(de_genes_0.2, sets = colnames(de_genes_0.2))# comparison of padj and logfc
padjS <- p.adjust(lrtS$table$PValue, method = "BH")
padjB <- p.adjust(lrtB$table$PValue, method = "BH")
padj <- data.frame(Salmon = padjS, Bambu = padjB)
logfcS <- lrtS$table$logFC.day5
logfcB <- lrtB$table$logFC.day5
logfc <- data.frame(Salmon = logfcS, Bambu = logfcB)
ggplot(data = padj, mapping = aes(x = -log10(Salmon), y = -log10(Bambu))) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of P-values (NDR=0.2)", x = "-log10 P-value (Salmon)", y = "-log10 P-value (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(data = logfc, mapping = aes(x = Salmon, y = Bambu)) +
geom_point() +
geom_smooth() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_viridis() +
labs(title = "Comparison of LogFC (NDR=0.2)", x = "LogFC (Salmon)", y = "LogFC (Bambu)")`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The overlay of the results and the logFC/p-value comparsion showed similar results compared to the other NDR cutoffs.
# time point comparison
## Salmon
fit <- glmFit(y_salmon_0.2, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_s0.2_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_s0.2_tp) day5
Down 2556
NotSig 19230
Up 4226
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_s0.2_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Salmon 0.2 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))## Bambu
fit <- glmFit(y_bambu_0.2, design)
lrt <- glmLRT(fit, coef = 6)
decide_dif_b0.2_tp <-
decideTests.DGELRT(lrt,
adjust.method = "BH",
p.value = 0.01,
lfc = log2(2))
summary(decide_dif_b0.2_tp) day5
Down 2193
NotSig 20389
Up 3430
ggplot(
data = as.data.frame(lrt$table),
mapping = aes(
x = logFC,
y = -log10(p.adjust(PValue, method = "BH")),
color = as.factor(decide_dif_b0.2_tp)
)
) +
geom_point() +
geom_vline(xintercept = log2(2), linetype = "dashed") +
geom_vline(xintercept = -log2(2), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
labs(
title = "Differential gene expression Bambu 0.2 Day5-Day0",
x = "LogFC",
y = "-10log(p-value)",
color = "DE"
) +
scale_color_viridis_d() +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))For the time point comparison, 2467 down-regulated and 4093 up-regulated genes for Illumina+Salmon and 2093 down-regulated and 3265 up-regulated genes for PacBio+Bambu were identified. In total, 26012 were tested.
The differential gene expression analysis mainly showed two things. First, it showed that the results of the different NDR cutoff rates are very similar, the main difference here is that more genes are in the data in the first place from the quantification. Second, it showed that the results of the two different technologies, Illumina+Salmon and PacBio+Bambu, are quite similar, but also show some differences. When treating the Illumina+Salmon data as the ground truth, it can be said that PacBio+Bambu is pretty close to the truth, but still shows some deviations.
Following a differential gene expression analysis, the identified genes can be grouped into gene sets of similar genes and a function or pathways analysis can be conducted using these gene sets. To do so, we decided to use only the genes that were detected by the analyses of both technologies, so the overlap which can be seen in the UpSet plot. Since the results of the different NDR cutoffs are very similar, we also decided to use only one of the cutoffs, namely the NDR cutoff of 0.2. This is the highest cutoff and therefore the data includes the most transcripts/genes.
Grouping of DE genes
To group the differentially expressed genes, the average expression of each gene at each time point was calculated first.
# create a list of the genes called DE by both methods
s <- which(de_genes_0.2$Salmon_0.2 == 1)
b <- which(de_genes_0.2$Bambu_0.2 == 1)
s_b <- intersect(s, b)
DEG <- rownames(de_genes_0.2[s_b,])
# calculate the average expression of each gene at each time point (average of the replicates)
avg_expr <- as.data.frame(sapply(sort(unique(meta_samples$time))[c(1,4,6)], function(time) {
rowMeans(salmon_0.2_gene[, which(meta_samples$time == time)])
}))
# for time points 1, 2 and 4, only one replicate exists
avg_expr$V4 <- unlist(as.vector(salmon_0.2_gene[, which(meta_samples$time == 1)]))
avg_expr$V5 <- unlist(as.vector(salmon_0.2_gene[, which(meta_samples$time == 2)]))
avg_expr$V6 <- unlist(as.vector(salmon_0.2_gene[, which(meta_samples$time == 4)]))
avg_expr <- avg_expr[, c(1,4,5,2,6,3)]
colnames(avg_expr) <- sort(unique(meta_samples$time))The genes are then clustered using the their correlation as a distance measure (to be more exact: 1-correlation). The cluster dendrogram of the hierarchical clustering as well as a heat map of the correlation including the dendrogram are shown below.
corr_DEG <- cor(t(avg_expr[DEG,]), method = "spearman")
hcl_DEG <- hclust(as.dist(1 - corr_DEG), method = "complete")
plot(hcl_DEG, label = FALSE)cl_DEG <- cutree(hcl_DEG, k = 4)
heatmap.2(corr_DEG, Rowv = as.dendrogram(hcl_DEG), Colv = as.dendrogram(hcl_DEG),
trace = "none", scale = "none", labRow = NA, labCol = NA, col = viridis,
ColSideColors = rainbow(15)[cl_DEG])The genes were then split into 4 clusters by visual inspection, which can be seen in the plot above. The bar indicates the 4 clusters by different colors (yellow, red, green and orange). The average expression of the genes of the 4 clusters at each time point are shown below:
avg_expr_DEG_list <- tapply(names(cl_DEG), cl_DEG, function(x) avg_expr[x,])
scaled_expr_DEG_list <- lapply(avg_expr_DEG_list, function(x) t(scale(t(x))))
layout(matrix(1:4, nrow = 2, byrow = T))
for(cl in 1:4)
boxplot(scaled_expr_DEG_list[[cl]],
main = paste0(cl, " (", nrow(scaled_expr_DEG_list[[cl]]), ")"))The expression pattern of cluster 1 seems to follow a downwards trend, meaning that cluster 1 might contain mostly genes that are down-regulated over the time course. Clusters 2 and 3 show an upwards trend in expression, so these clusters might contain mostly genes that are up-regulated over the time course. Cluster 4 does not follow an apparent trend, there are many outliers visible in at all time points and the hight of the boxes seem to fluctuate rather than following a trend. Moreover, this cluster is also the smallest one.
For the following function and pathway analysis, the 4 clusters were used as gene sets.
Function/Pathway analysis
The pathway analysis was performed using two different methods. First, the method DAVID was used, which is a frequency-based method. Second, the method GSEA was used, which is a ranked-based method.
DAVID (frequency-based)
To perform DAVID on the 4 gene sets, the ensemble gene ids of the genes must be extracted and then written into text files, which are uploaded to the DAVID web tool. Since same of the newly discovered transcripts could not be mapped to an existing gene, there are some genes in the gene sets that do not have an ensemble gene id compatible with DAVID. These genes were therefore removed from the analysis. As explained in the methods part, the results of DAVID depend on the background gene set which is used during the analysis. Here, the background list contained all genes of the 0.2 NDR cutoff data after filtering, with the expection of newly identified genes.
get_ensemble_id <- function(input_string) {
ensemble_id_version <- unlist(strsplit(input_string, "\\."))
return(ensemble_id_version[1])
}
input <- names(which(cl_DEG == 1))
names_cluster1 <- unname(sapply(input, get_ensemble_id))
input <- names(which(cl_DEG == 2))
names_cluster2 <- unname(sapply(input, get_ensemble_id))
input <- names(which(cl_DEG == 3))
names_cluster3 <- unname(sapply(input, get_ensemble_id))
input <- names(which(cl_DEG == 4))
names_cluster4 <- unname(sapply(input, get_ensemble_id))
input <- rownames(y_salmon_0.2)
background <- unname(sapply(input, get_ensemble_id))
write.table(
names_cluster1[-which(startsWith(names_cluster1, "Bambu"))],
file = "../genes_C1.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
names_cluster2,
file = "../genes_C2.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
names_cluster3,
file = "../genes_C3.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
names_cluster4,
file = "../genes_C4.txt",
quote = F,
row.names = F,
col.names = F
)
write.table(
background[-which(startsWith(background, "Bambu"))],
file = "../genes_expressed.txt",
quote = F,
row.names = F,
col.names = F
)The results were downloaded from the DAVID web tool as text files and contain terms describing the functions or pathways different genes are involved in. The results also contain other measures, like the count of genes of the input gene set that are associated with a specific term, as well as adjusted p-values showing the significance of this term in the input gene set. The results were then filtered by the adjusted p-values. The adjustment was conducted using the Benjamini-Hochberg method. Following, the results were sorted by their adjusted p-values in ascending order.
# import DAVID results
cluster1 <- read.csv("../DAVID/C1_expressed_background.txt", sep = "\t")
cluster2 <- read.csv("../DAVID/C2_expressed_background.txt", sep = "\t")
cluster3 <- read.csv("../DAVID/C3_expressed_background.txt", sep = "\t")
cluster4 <- read.csv("../DAVID/C4_expressed_background.txt", sep = "\t")
# get significant terms
cluster1_sig <- cluster1 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster1_terms <- cluster1_sig$Term) [1] "KW-0770~Synapse"
[2] "KW-0628~Postsynaptic cell membrane"
[3] "KW-1003~Cell membrane"
[4] "hsa04727:GABAergic synapse"
[5] "hsa04723:Retrograde endocannabinoid signaling"
[6] "KW-0325~Glycoprotein"
[7] "IPR018000:Neurotransmitter-gated ion-channel, conserved site"
[8] "IPR006029:Neurotransmitter-gated ion-channel transmembrane domain"
[9] "IPR006202:Neurotransmitter-gated ion-channel ligand-binding"
[10] "IPR006201:Neurotransmitter-gated ion-channel"
[11] "GO:0030594~neurotransmitter receptor activity"
[12] "hsa05033:Nicotine addiction"
[13] "hsa05032:Morphine addiction"
[14] "GO:0045202~synapse"
[15] "KW-1015~Disulfide bond"
[16] "GO:0050877~neurological system process"
[17] "GO:0007268~chemical synaptic transmission"
[18] "KW-1071~Ligand-gated ion channel"
[19] "hsa04724:Glutamatergic synapse"
[20] "GO:0045211~postsynaptic membrane"
[21] "KW-0732~Signal"
[22] "DOMAIN:Neurotransmitter-gated ion-channel transmembrane"
[23] "CARBOHYD:N-linked (GlcNAc...) asparagine"
[24] "GO:0005230~extracellular ligand-gated ion channel activity"
[25] "DOMAIN:Neurotransmitter-gated ion-channel ligand-binding"
[26] "KW-0472~Membrane"
[27] "GO:0099060~integral component of postsynaptic specialization membrane"
[28] "GO:0005886~plasma membrane"
[29] "hsa04725:Cholinergic synapse"
[30] "KW-0429~Leber hereditary optic neuropathy"
cluster2_sig <- cluster2 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster2_terms <- cluster2_sig$Term) [1] "KW-0325~Glycoprotein"
[2] "KW-1015~Disulfide bond"
[3] "CARBOHYD:N-linked (GlcNAc...) asparagine"
[4] "KW-0732~Signal"
[5] "KW-0964~Secreted"
[6] "GO:0005615~extracellular space"
[7] "KW-0217~Developmental protein"
[8] "KW-9996~Developmental protein"
[9] "GO:0005886~plasma membrane"
[10] "GO:0005576~extracellular region"
[11] "KW-0371~Homeobox"
[12] "IPR020479:Homeodomain, metazoa"
[13] "IPR001356:Homeodomain"
[14] "IPR017970:Homeobox, conserved site"
[15] "SM00389:HOX"
[16] "TOPO_DOM:Extracellular"
[17] "DNA_BIND:Homeobox"
[18] "KW-0272~Extracellular matrix"
[19] "GO:0005887~integral component of plasma membrane"
[20] "IPR009057:Homeodomain-like"
[21] "GO:0001525~angiogenesis"
[22] "GO:1990837~sequence-specific double-stranded DNA binding"
[23] "KW-0106~Calcium"
[24] "GO:0009952~anterior/posterior pattern specification"
[25] "GO:0031012~extracellular matrix"
[26] "KW-0130~Cell adhesion"
[27] "KW-0675~Receptor"
[28] "KW-1003~Cell membrane"
[29] "GO:0009986~cell surface"
[30] "IPR001827:Homeobox protein, antennapedia type, conserved site"
[31] "GO:0005201~extracellular matrix structural constituent"
[32] "TOPO_DOM:Cytoplasmic"
[33] "GO:0005788~endoplasmic reticulum lumen"
[34] "GO:0009897~external side of plasma membrane"
[35] "GO:0007155~cell adhesion"
[36] "GO:0001501~skeletal system development"
[37] "DOMAIN:Homeobox"
[38] "GO:0007165~signal transduction"
[39] "GO:0000785~chromatin"
[40] "IPR013783:Immunoglobulin-like fold"
[41] "GO:0048704~embryonic skeletal system morphogenesis"
[42] "MOTIF:Antp-type hexapeptide"
[43] "IPR000742:Epidermal growth factor-like domain"
[44] "KW-0037~Angiogenesis"
[45] "GO:0001570~vasculogenesis"
[46] "IPR001881:EGF-like calcium-binding"
[47] "KW-0245~EGF-like domain"
[48] "IPR009030:Insulin-like growth factor binding protein, N-terminal"
[49] "hsa04611:Platelet activation"
[50] "KW-0393~Immunoglobulin domain"
[51] "SM00181:EGF"
[52] "IPR003599:Immunoglobulin subtype"
[53] "DOMAIN:EGF-like 1"
[54] "GO:0043235~receptor complex"
[55] "IPR017995:Homeobox protein, antennapedia type"
[56] "GO:0045766~positive regulation of angiogenesis"
[57] "GO:0030198~extracellular matrix organization"
[58] "GO:0010628~positive regulation of gene expression"
[59] "GO:0005509~calcium ion binding"
[60] "SM00179:EGF_CA"
[61] "IPR007110:Immunoglobulin-like domain"
[62] "KW-0165~Cleavage on pair of basic residues"
[63] "GO:0003180~aortic valve morphogenesis"
[64] "IPR018097:EGF-like calcium-binding, conserved site"
[65] "DOMAIN:EGF-like"
[66] "GO:0001228~transcriptional activator activity, RNA polymerase II transcription regulatory region sequence-specific binding"
[67] "GO:0045944~positive regulation of transcription from RNA polymerase II promoter"
[68] "IPR000152:EGF-type aspartate/asparagine hydroxylation site"
[69] "hsa04610:Complement and coagulation cascades"
[70] "GO:0005925~focal adhesion"
[71] "GO:0007267~cell-cell signaling"
[72] "GO:0007179~transforming growth factor beta receptor signaling pathway"
[73] "GO:0014068~positive regulation of phosphatidylinositol 3-kinase signaling"
[74] "GO:0030335~positive regulation of cell migration"
[75] "SM00409:IG"
[76] "GO:0006954~inflammatory response"
[77] "DOMAIN:Ig-like C2-type 2"
[78] "DOMAIN:Ig-like C2-type 1"
[79] "DOMAIN:EGF-like 2"
[80] "GO:0005518~collagen binding"
[81] "MOTIF:Cell attachment site"
[82] "GO:0001958~endochondral ossification"
[83] "GO:0030199~collagen fibril organization"
[84] "DOMAIN:EGF-like 6"
[85] "GO:0001649~osteoblast differentiation"
[86] "hsa04151:PI3K-Akt signaling pathway"
[87] "hsa04510:Focal adhesion"
[88] "GO:0008201~heparin binding"
[89] "GO:0001568~blood vessel development"
[90] "DOMAIN:EGF-like 3"
[91] "KW-0358~Heparin-binding"
[92] "KW-0176~Collagen"
[93] "GO:0050840~extracellular matrix binding"
[94] "GO:0051216~cartilage development"
[95] "hsa04512:ECM-receptor interaction"
[96] "GO:0007186~G-protein coupled receptor signaling pathway"
[97] "GO:0060065~uterus development"
[98] "DOMAIN:Ig-like"
[99] "GO:0008284~positive regulation of cell proliferation"
[100] "KW-0765~Sulfation"
[101] "KW-0221~Differentiation"
[102] "DOMAIN:EGF-like 5"
[103] "DOMAIN:EGF-like 4"
[104] "KW-0430~Lectin"
[105] "GO:0034446~substrate adhesion-dependent cell spreading"
[106] "KW-0472~Membrane"
[107] "GO:0032967~positive regulation of collagen biosynthetic process"
[108] "GO:0003148~outflow tract septum morphogenesis"
[109] "hsa05150:Staphylococcus aureus infection"
[110] "hsa04060:Cytokine-cytokine receptor interaction"
[111] "IPR003598:Immunoglobulin subtype 2"
[112] "GO:0051897~positive regulation of protein kinase B signaling"
[113] "GO:0005178~integrin binding"
[114] "GO:0030509~BMP signaling pathway"
[115] "GO:0048844~artery morphogenesis"
[116] "GO:0030154~cell differentiation"
[117] "GO:2000352~negative regulation of endothelial cell apoptotic process"
[118] "GO:0035115~embryonic forelimb morphogenesis"
[119] "GO:0005581~collagen trimer"
[120] "GO:0045121~membrane raft"
[121] "GO:0019955~cytokine binding"
[122] "GO:0050679~positive regulation of epithelial cell proliferation"
[123] "hsa04015:Rap1 signaling pathway"
[124] "GO:0055010~ventricular cardiac muscle tissue morphogenesis"
[125] "GO:0060038~cardiac muscle cell proliferation"
[126] "GO:0016477~cell migration"
[127] "GO:0060021~palate development"
[128] "GO:0003700~transcription factor activity, sequence-specific DNA binding"
[129] "GO:0048407~platelet-derived growth factor binding"
[130] "GO:0002062~chondrocyte differentiation"
[131] "GO:0007596~blood coagulation"
[132] "GO:0035924~cellular response to vascular endothelial growth factor stimulus"
[133] "GO:0070374~positive regulation of ERK1 and ERK2 cascade"
[134] "GO:0006955~immune response"
[135] "GO:0043627~response to estrogen"
[136] "GO:0007507~heart development"
[137] "GO:0009954~proximal/distal pattern formation"
[138] "GO:0008285~negative regulation of cell proliferation"
[139] "GO:0007160~cell-matrix adhesion"
[140] "KW-0654~Proteoglycan"
[141] "IPR000372:Leucine-rich repeat-containing N-terminal"
[142] "GO:0060395~SMAD protein signal transduction"
[143] "GO:0016525~negative regulation of angiogenesis"
[144] "GO:0035025~positive regulation of Rho protein signal transduction"
[145] "GO:0051145~smooth muscle cell differentiation"
[146] "GO:0038023~signaling receptor activity"
[147] "GO:0003007~heart morphogenesis"
[148] "GO:0014911~positive regulation of smooth muscle cell migration"
[149] "hsa05144:Malaria"
[150] "GO:0010718~positive regulation of epithelial to mesenchymal transition"
[151] "GO:0001837~epithelial to mesenchymal transition"
[152] "GO:0001503~ossification"
[153] "GO:0008217~regulation of blood pressure"
[154] "GO:0031100~animal organ regeneration"
[155] "GO:0002053~positive regulation of mesenchymal cell proliferation"
[156] "GO:0010463~mesenchymal cell proliferation"
[157] "IPR008160:Collagen triple helix repeat"
[158] "hsa05152:Tuberculosis"
[159] "GO:0001938~positive regulation of endothelial cell proliferation"
[160] "GO:0000976~transcription regulatory region sequence-specific DNA binding"
[161] "GO:0005796~Golgi lumen"
[162] "GO:0009611~response to wounding"
[163] "GO:0050729~positive regulation of inflammatory response"
[164] "GO:0001656~metanephros development"
[165] "SM00013:LRRNT"
[166] "GO:0010629~negative regulation of gene expression"
[167] "GO:0015026~coreceptor activity"
[168] "SM00408:IGc2"
[169] "TRANSMEM:Helical"
[170] "GO:0003151~outflow tract morphogenesis"
[171] "IPR016186:C-type lectin-like"
[172] "hsa05200:Pathways in cancer"
[173] "hsa04380:Osteoclast differentiation"
[174] "GO:0005102~receptor binding"
[175] "GO:0007275~multicellular organism development"
[176] "GO:0016021~integral component of membrane"
[177] "GO:0090050~positive regulation of cell migration involved in sprouting angiogenesis"
[178] "GO:0030246~carbohydrate binding"
[179] "GO:0000981~RNA polymerase II transcription factor activity, sequence-specific DNA binding"
[180] "GO:0001822~kidney development"
[181] "GO:0072562~blood microparticle"
[182] "GO:0031093~platelet alpha granule lumen"
[183] "GO:0071773~cellular response to BMP stimulus"
[184] "GO:0042475~odontogenesis of dentin-containing tooth"
[185] "GO:0045599~negative regulation of fat cell differentiation"
[186] "GO:0045165~cell fate commitment"
[187] "GO:0048146~positive regulation of fibroblast proliferation"
[188] "GO:0000978~RNA polymerase II core promoter proximal region sequence-specific DNA binding"
[189] "GO:0004888~transmembrane signaling receptor activity"
[190] "GO:0060412~ventricular septum morphogenesis"
[191] "KW-0297~G-protein coupled receptor"
[192] "KW-0807~Transducer"
[193] "GO:0007229~integrin-mediated signaling pathway"
[194] "hsa04010:MAPK signaling pathway"
[195] "hsa05142:Chagas disease"
[196] "KW-0356~Hemostasis"
[197] "KW-0094~Blood coagulation"
[198] "KW-0892~Osteogenesis"
[199] "KW-0965~Cell junction"
[200] "GO:0048706~embryonic skeletal system development"
[201] "GO:0001666~response to hypoxia"
[202] "GO:0045746~negative regulation of Notch signaling pathway"
[203] "DOMAIN:EGF-like 2; calcium-binding"
[204] "hsa05202:Transcriptional misregulation in cancer"
[205] "KW-0395~Inflammatory response"
[206] "GO:0042127~regulation of cell proliferation"
[207] "GO:0050728~negative regulation of inflammatory response"
[208] "GO:0042060~wound healing"
[209] "IPR017452:GPCR, rhodopsin-like, 7TM"
[210] "KW-0391~Immunity"
[211] "KW-0873~Pyrrolidone carboxylic acid"
[212] "hsa04020:Calcium signaling pathway"
[213] "hsa04514:Cell adhesion molecules"
[214] "KW-0336~GPI-anchor"
[215] "hsa04350:TGF-beta signaling pathway"
[216] "GO:0061036~positive regulation of cartilage development"
[217] "hsa05032:Morphine addiction"
[218] "GO:0043410~positive regulation of MAPK cascade"
[219] "IPR000276:G protein-coupled receptor, rhodopsin-like"
[220] "IPR003961:Fibronectin, type III"
[221] "GO:0060045~positive regulation of cardiac muscle cell proliferation"
[222] "GO:0007169~transmembrane receptor protein tyrosine kinase signaling pathway"
[223] "GO:0030097~hemopoiesis"
[224] "GO:0006935~chemotaxis"
[225] "GO:0005539~glycosaminoglycan binding"
[226] "IPR016187:C-type lectin fold"
[227] "GO:0048010~vascular endothelial growth factor receptor signaling pathway"
[228] "GO:0035987~endodermal cell differentiation"
[229] "GO:0048008~platelet-derived growth factor receptor signaling pathway"
[230] "GO:0048286~lung alveolus development"
[231] "DOMAIN:Fibronectin type-III"
[232] "GO:0035579~specific granule membrane"
[233] "GO:0060216~definitive hemopoiesis"
[234] "GO:0040037~negative regulation of fibroblast growth factor receptor signaling pathway"
[235] "CARBOHYD:N-linked (GlcNAc...) (complex) asparagine"
[236] "GO:1905370~serine-type endopeptidase complex"
[237] "hsa04979:Cholesterol metabolism"
[238] "GO:0071560~cellular response to transforming growth factor beta stimulus"
[239] "GO:0030168~platelet activation"
[240] "GO:0010595~positive regulation of endothelial cell migration"
[241] "IPR013106:Immunoglobulin V-set"
[242] "IPR013151:Immunoglobulin"
[243] "GO:0001934~positive regulation of protein phosphorylation"
[244] "GO:0032731~positive regulation of interleukin-1 beta production"
[245] "IPR000047:Helix-turn-helix motif"
[246] "IPR001304:C-type lectin"
[247] "GO:0033089~positive regulation of T cell differentiation in thymus"
[248] "GO:0070062~extracellular exosome"
[249] "hsa05205:Proteoglycans in cancer"
[250] "hsa04974:Protein digestion and absorption"
cluster3_sig <- cluster3 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster3_terms <- cluster3_sig$Term) [1] "GO:0005886~plasma membrane"
[2] "KW-0325~Glycoprotein"
[3] "KW-1015~Disulfide bond"
[4] "CARBOHYD:N-linked (GlcNAc...) asparagine"
[5] "KW-0472~Membrane"
[6] "KW-1003~Cell membrane"
[7] "KW-0732~Signal"
[8] "TOPO_DOM:Cytoplasmic"
[9] "KW-0130~Cell adhesion"
[10] "KW-0964~Secreted"
[11] "TOPO_DOM:Extracellular"
[12] "GO:0009986~cell surface"
[13] "KW-0037~Angiogenesis"
[14] "GO:0016021~integral component of membrane"
[15] "GO:0070062~extracellular exosome"
[16] "GO:0005615~extracellular space"
[17] "GO:0005576~extracellular region"
[18] "KW-0106~Calcium"
[19] "GO:0007165~signal transduction"
[20] "GO:0007155~cell adhesion"
[21] "TRANSMEM:Helical"
[22] "GO:0005788~endoplasmic reticulum lumen"
[23] "KW-1133~Transmembrane helix"
[24] "KW-0812~Transmembrane"
[25] "GO:0009897~external side of plasma membrane"
[26] "IPR011993:Pleckstrin homology-like domain"
[27] "GO:0016020~membrane"
[28] "KW-0768~Sushi"
[29] "hsa04510:Focal adhesion"
[30] "KW-0735~Signal-anchor"
[31] "KW-0344~Guanine-nucleotide releasing factor"
[32] "KW-0245~EGF-like domain"
[33] "KW-0084~Basement membrane"
[34] "GO:0007160~cell-matrix adhesion"
[35] "hsa05200:Pathways in cancer"
[36] "hsa04015:Rap1 signaling pathway"
[37] "hsa04014:Ras signaling pathway"
[38] "hsa04640:Hematopoietic cell lineage"
[39] "GO:0005856~cytoskeleton"
[40] "KW-0034~Amyloid"
[41] "GO:0007264~small GTPase mediated signal transduction"
[42] "GO:0001525~angiogenesis"
[43] "hsa04512:ECM-receptor interaction"
[44] "KW-0272~Extracellular matrix"
[45] "KW-0391~Immunity"
[46] "GO:0005794~Golgi apparatus"
[47] "hsa04148:Efferocytosis"
[48] "GO:0005925~focal adhesion"
cluster4_sig <- cluster4 %>%
filter(Benjamini < 0.01) %>%
arrange(Benjamini)
(cluster4_terms <- cluster4_sig$Term) [1] "GO:0005730~nucleolus"
[2] "GO:0003723~RNA binding"
[3] "GO:0005654~nucleoplasm"
[4] "KW-0539~Nucleus"
[5] "GO:0005694~chromosome"
[6] "GO:0006364~rRNA processing"
[7] "KW-0597~Phosphoprotein"
[8] "KW-0007~Acetylation"
[9] "KW-0698~rRNA processing"
[10] "KW-0690~Ribosome biogenesis"
[11] "KW-0235~DNA replication"
[12] "GO:0032040~small-subunit processome"
[13] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO2)"
[14] "KW-0067~ATP-binding"
[15] "GO:0006260~DNA replication"
[16] "KW-0158~Chromosome"
[17] "KW-0809~Transit peptide"
[18] "KW-0694~RNA-binding"
[19] "KW-0832~Ubl conjugation"
[20] "KW-1017~Isopeptide bond"
[21] "hsa04110:Cell cycle"
[22] "KW-0347~Helicase"
[23] "KW-0496~Mitochondrion"
[24] "GO:0005524~ATP binding"
[25] "GO:0016887~ATPase activity"
[26] "hsa03030:DNA replication"
[27] "KW-0853~WD repeat"
[28] "GO:0005634~nucleus"
[29] "GO:0042274~ribosomal small subunit biogenesis"
[30] "KW-0547~Nucleotide-binding"
[31] "hsa03008:Ribosome biogenesis in eukaryotes"
[32] "GO:0005739~mitochondrion"
[33] "GO:0006268~DNA unwinding involved in DNA replication"
[34] "KW-0131~Cell cycle"
[35] "GO:0000462~maturation of SSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)"
[36] "TRANSIT:Mitochondrion"
[37] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO2); alternate"
[38] "REPEAT:WD 2"
[39] "REPEAT:WD 1"
[40] "REPEAT:WD 3"
[41] "SM00490:HELICc"
[42] "REPEAT:WD 6"
[43] "GO:0071162~CMG complex"
[44] "DOMAIN:Helicase ATP-binding"
[45] "REPEAT:WD 5"
[46] "REPEAT:WD 4"
[47] "GO:0030687~preribosome, large subunit precursor"
[48] "SM00320:WD40"
[49] "DOMAIN:Helicase C-terminal"
[50] "GO:0042555~MCM complex"
[51] "KW-0498~Mitosis"
[52] "GO:0017116~single-stranded DNA-dependent ATP-dependent DNA helicase activity"
[53] "REPEAT:WD 7"
[54] "COMPBIAS:Acidic residues"
[55] "SM00487:DEXDc"
[56] "GO:0003678~DNA helicase activity"
[57] "GO:0003688~DNA replication origin binding"
[58] "SM00350:MCM"
[59] "KW-0658~Purine biosynthesis"
[60] "GO:0006270~DNA replication initiation"
[61] "GO:0042273~ribosomal large subunit biogenesis"
[62] "IPR001650:Helicase, C-terminal"
[63] "IPR012340:Nucleic acid-binding, OB-fold"
[64] "GO:0044208~'de novo' AMP biosynthetic process"
[65] "DOMAIN:MCM"
[66] "IPR001680:WD40 repeat"
[67] "GO:0042645~mitochondrial nucleoid"
[68] "IPR014001:Helicase, superfamily 1/2, ATP-binding domain"
[69] "IPR001208:Mini-chromosome maintenance, DNA-dependent ATPase"
[70] "GO:0003724~RNA helicase activity"
[71] "GO:0030174~regulation of DNA-dependent DNA replication initiation"
[72] "GO:0030515~snoRNA binding"
[73] "GO:0000398~mRNA splicing, via spliceosome"
[74] "IPR018525:Mini-chromosome maintenance, conserved site"
[75] "IPR015943:WD40/YVTN repeat-like-containing domain"
[76] "CROSSLNK:Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in SUMO1); alternate"
[77] "KW-0489~Methyltransferase"
[78] "GO:0006189~'de novo' IMP biosynthetic process"
[79] "hsa03430:Mismatch repair"
[80] "IPR011545:DNA/RNA helicase, DEAD/DEAH box type, N-terminal"
[81] "KW-0137~Centromere"
[82] "KW-0132~Cell division"
[83] "GO:0005759~mitochondrial matrix"
[84] "GO:0006281~DNA repair"
[85] "COMPBIAS:Basic and acidic residues"
[86] "IPR027417:P-loop containing nucleoside triphosphate hydrolase"
[87] "hsa03013:Nucleocytoplasmic transport"
[88] "GO:0003682~chromatin binding"
[89] "GO:0051301~cell division"
[90] "GO:0097294~'de novo' XMP biosynthetic process"
[91] "GO:0004386~helicase activity"
[92] "GO:0030532~small nuclear ribonucleoprotein complex"
[93] "GO:0000727~double-strand break repair via break-induced replication"
[94] "GO:0006397~mRNA processing"
[95] "KW-0436~Ligase"
[96] "KW-0949~S-adenosyl-L-methionine"
[97] "GO:0006271~DNA strand elongation involved in DNA replication"
[98] "KW-0227~DNA damage"
[99] "KW-0747~Spliceosome"
[100] "h_mcmPathway:CDK Regulation of DNA Replication"
[101] "IPR019775:WD40 repeat, conserved site"
[102] "KW-0507~mRNA processing"
[103] "MOTIF:Arginine finger"
[104] "GO:0005515~protein binding"
[105] "GO:0006177~GMP biosynthetic process"
[106] "GO:0000466~maturation of 5.8S rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)"
[107] "KW-0234~DNA repair"
[108] "GO:0000463~maturation of LSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)"
[109] "KW-0508~mRNA splicing"
[110] "MOTIF:Nuclear localization signal"
[111] "GO:0003697~single-stranded DNA binding"
[112] "KW-0030~Aminoacyl-tRNA synthetase"
[113] "hsa03410:Base excision repair"
[114] "GO:0045943~positive regulation of transcription from RNA polymerase I promoter"
[115] "GO:0000781~chromosome, telomeric region"
[116] "GO:0019899~enzyme binding"
[117] "KW-0819~tRNA processing"
[118] "GO:0005743~mitochondrial inner membrane"
[119] "KW-1135~Mitochondrion nucleoid"
[120] "KW-0995~Kinetochore"
[121] "GO:0006164~purine nucleotide biosynthetic process"
[122] "hsa00670:One carbon pool by folate"
[123] "REPEAT:WD 13"
[124] "GO:0016604~nuclear body"
[125] "MOTIF:Q motif"
[126] "GO:0071013~catalytic step 2 spliceosome"
[127] "GO:0008168~methyltransferase activity"
[128] "SM00360:RRM"
[129] "GO:1990904~ribonucleoprotein complex"
[130] "REPEAT:WD 8"
The significant results for gene set 1 include associations with neurological functions and pathways, for instance signalling at the neurological synapse.
For gene set 2, many significant terms are found (250), which might be due to the large size of this gene set. The most significant terms included for instance Glycoprotein, Developmental Protein, Homeobox/domain and Extracellular Matrix.
The results of gene set 3 include associations with the cell surface and the cell membrane as well as cellular signalling and cell adhesion.
Lastly, the results of gene sets 4 again include more terms then the ones of gene sets 1 and 3, but less then gene set 2. This seems to be surprising when just looking at the sizes of the genes sets, since gene set 4 is with 680 genes by far the smallest gene set. However, as described above, the average expression of genes cluster 4 does not follow a trend similar to what can be observed in the other clusters, but it seems to fluctuate over the course of time. This might be be caused by cluster 4 being more heterogeneous than the other clusters, which could also explain why many more terms were found for this cluster than for clusters 1 and 3, even though they are larger. The significant terms of gene set 4 include functions related to the nucleus, DNA replication, transcription and translation.
When looking at all functions or pathways that DAVID found an association with, it seems like many different cellular components are involved in the changes over the course of time. …
GSEA (ranked-based)
genesets <- msigdbr(species = "Homo sapiens", category = "C8")
genesets_list <- tapply(genesets$ensembl_gene, genesets$gs_name, list)
fgsea_kegg <- fgsea(
pathways = genesets_list,
stats = ranked_list,
scoreType = "pos",
minSize = 15,
maxSize = 500
)Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (26.66% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
pathway pval padj
1: FAN_EMBRYONIC_CTX_NSC_1 5.815903e-04 0.0253407205
2: DESCARTES_FETAL_LUNG_VASCULAR_ENDOTHELIAL_CELLS 7.572261e-06 0.0007698466
3: TRAVAGLINI_LUNG_CLUB_CELL 2.390413e-06 0.0003645379
4: BUSSLINGER_GASTRIC_LYZ_POSITIVE_CELLS 1.912876e-05 0.0016669345
5: RUBENSTEIN_SKELETAL_MUSCLE_B_CELLS 2.181677e-07 0.0001330823
6: TRAVAGLINI_LUNG_CD4_NAIVE_T_CELL 5.772218e-06 0.0007042106
7: TRAVAGLINI_LUNG_CD8_MEMORY_EFFECTOR_T_CELL 1.203137e-02 0.2223979688
8: FAN_OVARY_CL0_XBP1_SELK_HIGH_STROMAL_CELL 2.012081e-06 0.0003645379
9: RUBENSTEIN_SKELETAL_MUSCLE_T_CELLS 4.895747e-07 0.0001493203
10: AIZARANI_LIVER_C12_NK_NKT_CELLS_4 5.019433e-03 0.1093519347
log2err ES NES size
1: 0.4772708 0.7915633 1.441168 17
2: 0.6105269 0.7044832 1.372288 46
3: 0.6272567 0.6366332 1.284880 106
4: 0.5756103 0.6283243 1.264615 98
5: 0.6901325 0.6118143 1.249911 163
6: 0.6105269 0.6153934 1.249589 128
7: 0.3807304 0.6579373 1.241902 26
8: 0.6272567 0.6057760 1.233199 143
9: 0.6594444 0.5981993 1.224873 177
10: 0.4070179 0.6299482 1.224036 44